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Abstract —Given low order moment information over 
the random variables X = {Xi,X2, ■ ■ ■ ,Xp) and Y, 
what distribution minimizes the Hirschfeld-Gebelein- 
Renyi (HGR) maximal correlation coefficient between X 
and Y, while remains faithful to the given moments? 
The answer to this question is important especially in 
order to fit models over (X, V) with minimum dependence 
among the random variables X and Y. In this paper, we 
investigate this question first in the continuous setting by 
showing that the jointly Gaussian distribution achieves the 
minimum HGR correlation coefficient among distributions 
with the given first and second order moments. Then, 
we pose a similar question in the discrete scenario by 
fixing the pairwise marginals of the random variables X 
and Y. Subsequently, we derive a lower bound for the 
HGR correlation coefficient over the class of distributions 
with fixed pairwise marginals. Then we show that this 
lower bound is tight if there exists a distribution with 
certain additive structure satisfying the given pairwise 
marginals. Moreover, the distribution with the additive 
structure achieves the minimum HGR correlation coef¬ 
ficient. Finally, we conclude by showing that the event 
of obtaining pairwise marginals containing an additive 
structured distribution has a positive Lebesgue measure 
over the probability simplex. 

I. Introduction 

A well-known measure of dependence between two 
random variables X and Y is the Pearson correlation 
coefficient which is defined as 

A/Var(X)Var(y) 

assuming that 0 < Var(X), Var(y) < oo. Clearly, this 
measure of dependence is zero if X and Y are inde¬ 
pendent; while the converse is not necessarily true in 
general. Furthermore, this measure of dependence fails 
in discovering nonlinear dependence among random 
variables. A closely related measure of dependence, 
which was first introduced by Hirschfeld and Gebelein 
m . II 2 II and then studied by Renyi 13, is the HGR 
maximal correlation coefficient defined as; 

Pr^{X,Y)^supE[f{X)g(Y)], (1) 
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where the maximization is taken over the class of all 
measurable functions / and g with the property that 
E[f{X)] = E[p(y)] = 0 and E[p{X)] = E[p2(y)] = 
1. The HGR correlation coefficient has many natural 
properties one would look for as a measure of depen¬ 
dence. For example, the correlation coefficient of two 
random variables X and Y is normalized to be between 
0 and 1. Furthermore, this coefficient is zero if and 
only if the two random variables are independent; and 
it is one if there is a strict dependence between X and 
Y ; see ||3 for other interesting properties of the HGR 
correlation. 

In the classical prediction problems the task is to 
predict the value of a random variable Y based on 
the observations on the random variable X. When the 
probability model on the random variables X and Y 
is not known, the first step in prediction is to fit a 
probabilistic model on the random variables based on 
the knowledge obtained from training data or other 
sources. A popular approach to fit a model for infer¬ 
ence/prediction is to use the maximum entropy principle 
a. This principle states that given a set of constraints 
on the ground truth distribution, the distribution with the 
maximum (Shannon) entropy under those constraints 
is a “proper" representer of the class. In practice, 
this idea can be implemented by estimating low or¬ 
der marginals from data and find a distribution with 
maximum entropy satisfying the low order marginals. 
The maximum entropy principle is in essence the sprit 
of variational method in graphical models. Applying 
similar idea to the prediction problem, one approach 
is to find a distribution that maximizes the conditional 
entropy of the random variable Y given X under a fixed 
set of marginals. Given the marginal distribution of Y, 
this principle is equivalent to minimizing the mutual 
information between the two random variables X and 
F; see, e.g., a, for which no efficient computational 
approach is known in the literature for high dimensional 
problems. When HGR correlation is used instead of 
mutual information, the problem of interest is to find 
the distribution over (X, Y) with the minimum HGR 
correlation between the two random variables X and 
Y which stays faithful to the estimated low order 
marginals. 





To answer the question of finding the distribution 
with minimum HGR correlation coefficient, we start 
by the trivial inequality pm{X,Y) > \p{X^Y)\ which 
holds for any two random variables X and Y. Inter¬ 
estingly, this inequality is tight when X and Y are 
jointly Gaussian ID, |[3l. In other words, among all 
possible distributions with fixed first and second order 
moments on X and Y, the jointly Gaussian distribution 
on {X,Y) achieves the minimum HGR correlation. 
Similar question rises in the case where the observed 
random variable in prediction is in the vector form: Let 
X = {Xi,X 2 , ■ ■ ■, Xp) be the observed random vector 
and Y be the random variable denoting the prediction 
target. Since in many recent prediction problems in 
statistics and machine learning, the random vector X is 
in a large dimensional space, estimating the complete 
joint distribution of X and Y is not computationally 
and statistically feasible from the data. Motivated by 
the variational approach in graphical models, one can 
estimate the first and second order moments of (X, Y) 
and fit a model which minimizes the dependence be¬ 
tween X and Y while stay faithful to the constraints 
on the first and second order moments. Therefore, the 
question of interests in this paper is as follows: 

Question: Given the first and second order moments of 
the two random variables X = {Xi,... ,Xp) and Y, 
what is the distribution with minimum HGR correlation 
coefficient between X and F? 

In this work, we investigate this question in both 
continuous and discrete scenarios. In particular, in sec¬ 
tion [U] we answer this question for the continuous case 
where both of the random variables X and Y are con¬ 
tinuous. Then, we cast a similar question in the discrete 
setting and partially answer it by first finding a lower 
bound for the minimum HGR correlation coefficient. 
Then, we show that, under a certain additive structure 
condition, this lower bound is tight and can be achieved 
through a certain probability distribution satisfying the 
pairwise constraints. 

II. Minimum HGR correlation distribution: 

CONTINUOUS SETTING 


and its proof is straightforward, but since we could not 
find it in the literature, we state it here. This result 
plus the additive property of the Gaussian distribution, 
which will be explained later, shed light on the discrete 
scenario as well. 

Theorem 1. Let y be the Gaussian probability 
distribution defined over X and Y with the given first 
and second order moments fi and A. Then y is a 
minimizer of i.e., 

P^ y G argmin pm(X, Y) 

Px,y 

s.t. Ep[(X Y)] = p 

Ep[(X Yf{X Y)] = A 


Proof: Under the given first and second order 
moments, the random variable Y can be written as 

y-/iy =a^(X-/Xx) + ^, ( 3 ) 


for some vector a G with the random variable Z be¬ 
ing uncorrelated with X. Here py and are the given 
first order moments of Y and X, respectively. Notice 
that the vector a and thus Var (a^X) are completely 
determined by the first and second order marginals fi 
and A. Therefore, taking the functions / and g to be 


/(X) 


^'^(X — Mx) 
V^Var (a^X) ’ 


9{y) 


Y - py 


( 4 ) 


in the HGR correlation definition O leads to the 
following lower bound for the minimum value of (|2]i: 


Hence the value of the obtained lower bound Q only 
depends on the first and second order moments of 
(X,y). Now we show that the jointly Gaussian dis¬ 
tribution with the given first and second order moments 
achieves the above lower bound. Under the jointly 
Gaussian distribution, the random variable Z in Q 
becomes independent from X and therefore 

{Xi,...,Xp) - sfiX-Y 


Let us assume that the first order moment 
E[(X y)] = p G and the second order 

moment E[(X y)^(X y)] = A G K(p+i)x(p+i) are 
given. Our goal is to find the probability distribution 
Px y (x, y) on the random variables X and Y which is 
the solution to the following optimization problem: 

min pm{X,Y) 

Px, y 

s.t. Ep[(X y)] = p ( 2 ) 

Ep[(x y)^(x y)] = A. 

The following simple result shows that the joint Gaus¬ 
sian distribution is always a solution of (|2]i. The result 


forms a Markov chain. Hence, using the alternative 
conditional definition of the HGR correlation Q , we 
obtain 

p„(X;y)=max E (E2(p(y)|X)) 

= max E(E2(p(y)|X,a'^X)) 

= max E (E2(p(y)|a^X)) = p^ (a^X; y) , 

where the maximizations are taken over the functions 
g{-) satisfying E(p(y)) = OandE(p^(y)) = 1. Clearly, 
(a^X, y) is distributed according to a jointly Gaus¬ 
sian distribution. Thus, pm{X;Y) = pm (a^X;y) = 

which completes the proof. ■ 
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Notice that the Gaussian distribution has the prop¬ 
erty that E[y|X] = some linear 

functions fi. As will be seen in the next section, this 
additive model property 0 , a, nni plays an important 
role in the discrete scenario as well. 


III. Minimum HGR correlation distribution: 

DISCRETE SETTING 


Let US consider the scenario where the random 
variables X and Y are discrete. Here, motivated by 
the standard binary classification problem in machine 
learning, we assume that the random variable X = 
(Xi, X 2 , ■ ■ ■, Xp) € XP is in a categorical structure 
with |X| < 00 , while the random variable X G Y = 
{0,1} is binary. Since the random variables are categor¬ 
ical, the relabeling of the alphabets X and Y should not 
affect any model fitting approach. Noticing that the first 
and second order moments of (X, Y) are not invariant 
to relabeling of the alphabets, we consider the class 
of distributions with a fixed given pairwise marginal 
distribution of {Xi,..., Xp,Y) instead of fixing the 
moments. This modification is equivalent to fixing the 
first and second order moments of the indicator/dummy 
variables im defined over our categorical data. More 
precisely, defining the random vector X = {Xf)i^x with 
Xf = \ \f Xi = X and Xf = 0 otherwise, the first and 
second order moment knowledge on X is equivalent 
to the pairwise marginals knowledge on Xi,...,Xp. 
Therefore, finding the distribution with the minimum 
HGR correlation coefficient between X and Y can be 
formally stated as 

min pm{X,Y) s.t. Px,y G C, 

where C is the class of distributions with given pairwise 
marginals defined as 


C = 


x.y 


P(Xi = x^,Xj =Xj)= 


P(Xi =Xi,Y = y) = pI , yxi,Xj G X, Vy G Y, Vj,j >. 


The optimization problem (| 6 ]l is convex in terms 
of the joint distribution Px,y; however, the number of 
variables grows exponentially in p. To deal with this 
exponential computational complexity and finding the 
solution of © indirectly, let us first define a lower 
bound on the HGR correlation coefficient between the 
two random variables X and Y with a given joint 
probability distribution Px,y: 

pI{X,Y) E[f{X)giY)] 

f,g 

s.t. / G X, 

Ep[/(X)] = Ep[p(y)] = 0, 

Ep[/2(X)]=Ep[52(y)] = i, 


where X is the class of separable functions defined as 

X^lf\ /(X) = ^ with e. : X ^ M 

I i=i 

Clearly, p^^{X,Y) < PmO^,Y) since is 

obtained by restricting the feasible set in ©. The 
following theorem shows that the value of pJJijjX, F) 
is efficiently computable. 

Theorem 2. Suppose C ^ 0 and let us without loss of 
generality assume that X = {1,2,..., m}. Then 


Pm(X,F) yi (8) 

where 

7 ’*’= min z^Qz + d^z-I—, (9) 

2gRpmxl 4 

with Q G g j^pmxi defined as 

Qm(z—l)+/c,m(j —l)-t-r P(^z k^Xj f), 

= P(x, = fc, r = 1) - P(x, = k,Y = o), 

for every i,j = 1 ,... ,p and fc, £ = 1 ,..., m. 


Proof: The proof is provided in the Appendix. ■ 

Remark 1. Assume the pairwise marginals are esti¬ 
mated from a given dataset containing n data points. Let 
G {0, indicator of the j-th datapoint 

with = 1 if Xi = k in the j-th datapoint 

and (w^)m(i-i)+fc = 0 otherwise. Define the vector 
b G {—\j 5 }”^^ with bj = i if the random variable 
Y = 1 in the j-th datapoint and b^ = — otherwise. 
Then the optimization problem is equivalent to the 
following least squares regression problem 

min ||Wz-b||2, 

Z 

where W = [w^w^ ... w"]^. 


Theorem [2] simply states that the lower bound 
p^(X,F) is easily computable by solving a convex 
optimization problem; see ®, ®. Moreover, this lower 
bound only depends on the pairwise marginals of the 
distribution of Px.y- Consequently, p^(X,y) is the 
same across all distributions in C and therefore it is well- 
defined to denote as the lower bound pJJ{(X,F) 
achieved by any of the distributions in C. Furthermore, 
this quantity is also a lower bound for the optimum 
value of ©, i.e., < min pm(X.,Y). 

Px,v 6 C 

The following theorem provides an interpretative 
necessary and sufficient condition under which this 
lower bound is tight. Subsequently, in Theorem 01 we 
provide a computationally efficient approach to verify 
this condition. 
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Theorem 3. The achieved lower-bound tight, 

= „min p,„(X,r), 

rx.yec 

if and only if there exists a probability distribution 
P € C for which the conditional expectation Ep | X] 
takes a separable form, i.e. ffs exist such that 

p 

Ep[Y\X]=Y,MXi)- ( 10 ) 

i=l 

Proof: First, consider the probability distribution 
P for which Ep [y | X] = Notice that, 

when Y is binary, the function g{Y) = 

the only feasible function g{ ) in ([T]i. Therefore, the 
HGR correlation coefficient between X and Y can be 
calculated by 


which implies that Ep. [Y | X] is a separable function 
of X*’s. ■ 

In the following theorem, we introduce another nec¬ 
essary and sufficient condition under which the lower 
bound becomes tight. Before stating the result, let us 
define the convex function /i(z) : i->- R as 

p 

^ ^ maxj^Zyy^^—m+l ^ 5 • ■ - ; 

i=l 

Theorem 4. Assume C 7 ^ 0. Then, the lower-bound 
is tight if and only if there exists a solution z* to 
® satisfying h{z*) <1/2 and h{—z*) < 1/2. In other 
words, if and only if the following identity holds: 

7 ^*’ = min z^Qz + f'^z + ^ (15) 

z 4 

s.t. h{z) < 1 / 2 , and h(—z) < 1 / 2 . 


Pm(X.,Y) = max 


E 


/(X) 


Y - E[y] 


s/v^) 

s.t. E[/(X)]=0, E[/2(X)] =1 


( 11 ) 


where the expectations are taken with respect to the 
probability distribution P. Furthermore, the objective 
can be rewritten as 


E 


= E 


= E 


/(X) 


/(X)E 


/(X) 


Y - E[Y] 
v/Var(y) 

Y - E[y] 


= E 


X 


E 


/,x)i=£!nix 




xAMF) 


( 12 ) 


A simple application of the Cauchy-Schwarz inequality 
implies that the optimizer of (fTTT i is of the form 

\/Var(ELi/.(X.)) ’ 

which is in a separable form. Therefore, /*(•) is feasible 
to © and p^(X,r) =p^(X,y). 


To show the other direction, notice that the above 
Cauchy-Schwarz inequality holds with equality if and 
only if for a constant c 


/(X)=cE 


Y - E[y] 

\AMF) 


X 


(13) 


with probability one. Hence if pm(X,Y) = pJ^(X,y) 
there is a separable function function /*(X), a proba¬ 
bility distribution P* G C, and a constant c* such that 


r(x) = c*Ep. 


Y - E[y] 


X 


(14) 


Proof: The proof can be found in the Appendix. ■ 

Notice that in light of Theorems [3 and @1 we 
could identify in polynomial time whether there exists 
a probability distribution P G C with an additive model 
structure. Now it is interesting to investigate whether 
the conditions in Theorem |4] are satisfied for most 
of the classes, or happens for a negligible class of 
distributions. To formally study this question, let us start 
from a probability distribution Pq. Since X G XT with 
|X| = TO and Y G {0,1}, this probability distribution 
can be uniquely identified by a vector po G R^™*” 
with l^po = 1 , where 1 is the vector of all one. 
Define C(po) to be the class of probability distributions 
having the same pairwise marginals as po. Having this 
definition in our hands, the following result follows: 

Theorem 5. For the uniform probability vector p, there 
exists an e > 0 such that for any probability vector 
p with Up — p||i < e, the class C(p) contains an 
additive distribution, i.e., 3 P G C(p) with Ep[y | X] = 
Sr=i /i(Xi) for some functions /i : X 1 —> R, i = 
l,...,p. 

Proof: Let p be the uniform distribution over 
(X, Y), i.e. all (x, y)’s take the same probability. Define 
Q and f to be the matrix and vector in I© obtained for 
this uniform distribution. Clearly, f = 0 and therefore 
the objective of I© is minimized at z = 0 for which 
h[z) = h{—z) = 0 < 1 / 2 . 

Note that for any probability distribution and for any 
i = 1 ,... ,p the columns m{i — 1 ) 3 - 1 ,..., mi of the 
defined matrix Q sum to the unit vector. Therefore, 
rank(Q) < (to — l)p 3 - 1. Furthermore, it is easy to 
check that for the vector p, rank(Q) achieves this upper 
bound. The reason is that if we subtract the column 
mi 3 - j from column mi 3 - / — 1 for any j = 2 ,..., to 
and i = 0,... ,p — 1 , we obtain a vector taking — at 
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mi + i — 1, — at mi + 7, and zero elsewhere; which 
leads to {m — l)p independent vectors. Also note that 
any linear combination of these vectors sums to zero 
which shows that the dimension of the column space of 
Q is at least (m—l)p+l, i.e., rank(Q) = (m—l)p+l. 

Since the function rank(-) is lower-continuous, there 
exists an e > 0 for which rank(Q) < rank(Q) for 
any Q coming from the probability vector p with 
Up — p||i < e. Combining the fact that rank(Q) < 
{m — l)p + 1 for any p with the lower continu¬ 
ity of the rank(-) function implies that rank(Q) = 
{m — l)p + 1 for small enough e. Therefore, Moore- 
Penrose pseudoinverse behaves continuously under 
small perturbations in probability distribution; see da. 
Thus for any p in e-neighborhood of p, we have that 
|||Q^f||i — ||Q^f||i| < 1/2. Noticing the fact that 
max{/i(z*),/i(—z*)} < IIQ'l^fjli completes the proof. 

■ 

The above result simply states that the set of distri¬ 
butions p leading to the pairwise class C(p) containing 
an additive distribution has a positive Lebesgue measure 
over the simplex of probability vectors. Few remarks are 
in order; 

Remark 2. In the continuous case, the distribution with 
additive structure always exists in our fixed class of 
distributions since the jointly Gaussian distribution has 
additive form and there always exists a jointly Gaussian 
distribution with a given valid first and second order 
moments. 

Remark 3. The proposed lower bound Pm\\, is not 
always a solution to (HI. Consider the binary valued 
random variables Xi, X 2 , Y coming from the joint 
distribution P with 

^Xi,X2.y{0, 0,0) = 0, Pxi.Xa.y (Oj 0,1) = 0.1 
(IjOj 0) = 0.2, Pxi,X2,f(1, 0,1) = 0.2 
^Xi,X2,y{0, 1,0) = 0 . 1 , Pxi,X2,f(0, 1,1) = 0.3 

Pjs:i,jt2,y(li 0) = O-lj IPxi,X2,f(1, 1 , 1) = 0. 

Now consider the class of distributions obtained by the 
pairwise marginals of P. Then for any distribution P in 
our class, we have 

IPxi.X 2 ,y( 0 , 0 , 0 ) -f Pjs:i.X 2 ,x(l, 1 , 1 ) 

= Pxi.X 2 (l, 1) - Pxi,x(l, 0) -I- Px2.x(0,0) 

= F'xi.X 2(1, 1) - ]Pxi,y(l, 0) -I- Pjr2,y(0,0) 

= lPxi.X2.x(0,0,0) -I- Pxi.X2,x(1, 1,1) = 0. 

Since both 0 /P(Xi = 0,^2 = 0,F = 0), P(Xi = 
1,^2 = 1,Y =1) are non-negative, they should be 
both zero. Combining this fact and the fact that P and 
P have the same set of pairwise marginals implies that 
P = P. /n other words, our class of distributions with 


the given pairwise marginals is a singleton. Further¬ 
more, the conditional probability P(y | X) is given by 

p(y = i|Xi = o,X2 = o) = i 

p(y = i|Xi = i,X2 = 0) = i/2 

P(y = 11 X 1 = 0,X2 = 1) = 3/4 

p(r = i|Xi = i,x2 = 1 ) = 0 

Therefore, E(r|Xi = 1,^2 = l)-fE(r|Xi = 0,^2 = 
0) E(r|Xi = 1,X2 = 0) + E(y|Xi = o,X2 = 
1 ) which means that our distribution does not have an 
additive form, i.e. there is no distribution with additive 
structure in our class. Hence, this example illustrates a 
scenario where the proposed lower bound is not tight. 
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IV. Appendix 
A. Proof of Theorem 2 

First let us define w S to be the indicator 

random vector for X, i.e., 

l)+fc ~ = fc) = 


1 if Xi = fc 
0 otherwise. 


(17) 


Clearly, every separable function of X would be a linear 
function on W and therefore for every / € X there 
exists a vector a for which 


/(X) = a'^w. 


(18) 


Also notice that, when Y is binary, the variable Y = 
is the only feasible function of Y in ([Til. 

VVar(F) ^ 

The 


rherefore. 


pt[^(X,y)=max E 


a^ wV 


s.t. Ep[a‘'w]=0, 

Epfa^ww^al = 1. 


(19) 


Based on the definition of Q and d, for any P G C we 
have 

Q = Ep(ww’^), d =Ep(yw). (20) 


where 


, id 

d = ^ 


-F{Y = 1))E( w) 


v^p(y = o)P(y = 1) 


( 21 ) 


First notice that there exists u for which Qu = d 
since otherwise the objective function in ([19]) would not 
be bounded from above. Furthermore, E(w) is in the 
column space of Q and therefore there exists u with 
Qu = d. With this in mind, the optimization problem 
([13 can be rewritten as 

pJl5(X,F) = max a^d 

s.t. a^E[w] = 0, (22) 

a^Qa < 1. 

So the Lagrangian would be 
£(a,/3, A) = a^d + A(1 — a’^Qa) + ^a’^E[w] (23) 
and thus the Lagrange-dual function is 
/i(/3, A) = sup £(a, /3, A) 

a 

= sup a^d + A(1 — a^Qa) + /3a^E[w] 

a 

= A + ^ ^d -I- /3E[w]^ Q1 (^d -1- /3E[w]^ , 

(24) 

where Ql denotes the Moore-Penrose pseudoinverse of 
Q. Note that Q is not a full rank matrix and thus is not 


invertible. Here we have used that Q is positive semi- 
definite and that u exists where Qu = d to show 
that the above quadratic function in the supremum is 
upper-bounded, and therefore the last equality holds. 
In addition, notice that for any v where Qv = d , 
including Qld since u exists that Qu = d, we have 


Qv = d 

E [ww^] 

l^E [wv 

E [l 


= E 


T T 

ww 


Fw 

"Fw 

] v' = E F(l^w) 


■^1 V = l^E 


(25) 


^E[w'^] v' =E[F] = 0, 

where 1 is a vector with all 1 entries. Therefore, 

E [w]^ Qld = 0. 

Similarly, we can show 

E [w]^ QIE [w] = 1, 

E [w]^ Qld = P(F = 1) - P(F = 0). 

Let Pi = P(F = 1) and po = = 0)- To solve the 

dual problem, we have 

P::J(X,F)= min^h(/3,A) 

= min min h( 8 . A) 

0 x>o ^ ^ 


(26) 


(27) 


nun (d' + /3E[w])^ Ql (d' + /3E[w]) 
= y/nnn (d' + /3E[w])^ Qt (d' + /3E[w]) 


,/d'^Qld' 


- min /3^ 


= Vd Qld' 


(28) 


(c) /di'Qtd-2(l-2pi)' + (l-2pi)' 


4poPi 


(d) /l - 47 't’- (1 - 2 pi)^ 


4poPi 


= I 1- 


V PoPi 

Here (a) uses the fact that, for any rj > 0, the minimum 
value of X - f over x G IR+ is y/f. (b) and (c) follow 
from (i26l l. (1271) . and that po+ 7*1 = 1- (d) holds since, as 
shown above, Q is a positive-semidefinite matrix whose 
column space includes d, hence 


yb = 1 (1 - d^Qid). 
Therefore, the proof is complete. 


(29) 
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B. Proof of Theorem 4 


Similar to the proof provided for Theorem 2, let 
w € be the vector of indicator variables, i.e. 

W[i-i)m+k = = k) is the indicator variable for 

Xi = k. Define 6 = F — i. It can be seen that 


0 < Ep 


(rv^z — 6)^ 


z^Qz + d^z + 


1 

4’ 


(30) 


for any P € C. Therefore, assuming C f %, the above 
quadratic function takes its minimum value at any z 
satisfying 

2Qz + d = 0. (31) 


Due to Theorem |3l if becomes tight there exists a 
P* G C where 


Ep. [y|x] 


(32) 


Let 2 .* = (z*) is defined by Z(Vi)„+fc = ^ 
Then, there exist x* and x** for which 


fi(z*) = i - Ep. [r I X = X*] 

= i _ p*(y = 1|X = X*) < i 


and 




[F I X = X**] 


= _i+P*(F = i|X = x**)<i 


(33) 


(34) 


Note that the {{i — l)m + fc)-th entry of 2Qz* + d is 


2 E 


f>*{X, = k,X, = x){-f*[x) + -) 


F*{X, = k,Y = 1) -P*(Xi = k,Y = 0) = 


F*{Xi = k)-2 [P*(X = x)Ep. [F|X = x]] 

x: Xi—k 

+ P*(X, = fc, F = 1) - P*(X, = k,Y = 0) = 

F*{Xi = k) - 2F*{X, = k,Y = 1)+ (35) 

P*(X* = k,Y = 1) -P*(Xi = fc,F = 0) = 0. 


supposed is not empty. Note that according to (l36l) . since 
h{z*) < 1/2 and/i(—z*) <1/2 hold, P* is a valid joint 
distribution for which we can simply verify 

P*(X, = k, Xj =l) = Q(X, = k, X, = 1) 

= F{Xi = k, Xj = 1) ^ ’ 

for every i, j = 1, ■ ■ ■ ,p and k,l = 1, - ■ ■ ,m. Also, 

r(X, = fc,F=l)= ^ Q + z*^w;) Q(X = x) 

'x.-.Xi—k ^ ' 

= - P*(Xi = k) + (Qz*)(i_l)m+fc 

= - F*{Xi = k) — — f(i_i)m+fc 
= P(X, = A:,F = 1), 

(38) 


and thus P* G C. Notice that 

P*(F= l|X = x) = i+z*^w; (39) 

which shows P* has a separable conditional expectation, 
and hence due to Theorem |3l the lower-bound is 
tight. 


Therefore, z* satisfies the first order optimality 
condition of (|9]l with fi(z*) <1/2 and h{—z*) < 1/2. 
Consequently, (fTSl l holds. 

Now consider the converse direction. Assume there 
exists such a minimizer z*, we can consider the joint 
distribution P* defined as 

P*(X = x,F = 2 /) = - (-l)^z*^w;) Q(X = x) 

(36) 

where w* denotes the vector of indicator variables for 
X, and Q is a probability distribution in C which we 
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